home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Turnbull China Bikeride
/
Turnbull China Bikeride - Disc 2.iso
/
STUTTGART
/
TEMP
/
GNU
/
flex
/
Patterns
< prev
next >
Wrap
Text File
|
1995-06-28
|
8KB
|
304 lines
Patterns
Previous: <Format=>Format> * Next: <Matching=>Matching> * Up: <Top=>!Root>
#Wrap on
{fH3}Patterns{f}
The patterns in the input are written using an extended
set of regular expressions. These are:
#Indent +4
#Indent
{fEmphasis}x{f}
#Indent +4
match the character {fEmphasis}x{f}
#Indent
{fEmphasis}.{f}
#Indent +4
any character (byte) except newline
#Indent
{fEmphasis}[xyz]{f}
#Indent +4
a "character class"; in this case, the pattern
matches either an {fEmphasis}x{f}, a {fEmphasis}y{f}, or a {fEmphasis}z{f}
#Indent
{fEmphasis}[abj-oZ]{f}
#Indent +4
a "character class" with a range in it; matches
an {fEmphasis}a{f}, a {fEmphasis}b{f}, any letter from {fEmphasis}j{f} through {fEmphasis}o{f},
or a {fEmphasis}Z{f}
#Indent
{fEmphasis}[^A-Z]{f}
#Indent +4
a "negated character class", i.e., any character
but those in the class. In this case, any
character EXCEPT an uppercase letter.
#Indent
{fEmphasis}[^A-Z\\n]{f}
#Indent +4
any character EXCEPT an uppercase letter or
a newline
#Indent
{fEmphasis}{fStrong}r{f}\*{f}
#Indent +4
zero or more {fStrong}r{f}'s, where {fStrong}r{f} is any regular expression
#Indent
{fEmphasis}{fStrong}r{f}+{f}
#Indent +4
one or more {fStrong}r{f}'s
#Indent
{fEmphasis}{fStrong}r{f}?{f}
#Indent +4
zero or one {fStrong}r{f}'s (that is, "an optional {fStrong}r{f}")
#Indent
{fEmphasis}{fStrong}r{f}\{2,5\}{f}
#Indent +4
anywhere from two to five {fStrong}r{f}'s
#Indent
{fEmphasis}{fStrong}r{f}\{2,\}{f}
#Indent +4
two or more {fStrong}r{f}'s
#Indent
{fEmphasis}{fStrong}r{f}\{4\}{f}
#Indent +4
exactly 4 {fStrong}r{f}'s
#Indent
{fEmphasis}\{{fStrong}name{f}\}{f}
#Indent +4
the expansion of the "{fStrong}name{f}" definition
(see above)
#Indent
{fEmphasis}"[xyz]\\"foo"{f}
#Indent +4
the literal string: {fEmphasis}[xyz]"foo{f}
#Indent
{fEmphasis}\\{fStrong}x{f}{f}
#Indent +4
if {fStrong}x{f} is an {fEmphasis}a{f}, {fEmphasis}b{f}, {fEmphasis}f{f}, {fEmphasis}n{f}, {fEmphasis}r{f}, {fEmphasis}t{f}, or {fEmphasis}v{f},
then the ANSI-C interpretation of \\{fStrong}x{f}.
Otherwise, a literal {fEmphasis}{fStrong}x{f}{f} (used to escape
operators such as {fEmphasis}\*{f})
#Indent
{fEmphasis}\\0{f}
#Indent +4
a NUL character (ASCII code 0)
#Indent
{fEmphasis}\\123{f}
#Indent +4
the character with octal value 123
#Indent
{fEmphasis}\\x2a{f}
#Indent +4
the character with hexadecimal value {fCode}2a{f}
#Indent
{fEmphasis}({fStrong}r{f}){f}
#Indent +4
match an {fStrong}r{f}; parentheses are used to override
precedence (see below)
#Indent
{fEmphasis}{fStrong}r{f}{fStrong}s{f}{f}
#Indent +4
the regular expression {fStrong}r{f} followed by the
regular expression {fStrong}s{f}; called "concatenation"
#Indent
{fEmphasis}{fStrong}r{f}|{fStrong}s{f}{f}
#Indent +4
either an {fStrong}r{f} or an {fStrong}s{f}
#Indent
{fEmphasis}{fStrong}r{f}\/{fStrong}s{f}{f}
#Indent +4
an {fStrong}r{f} but only if it is followed by an {fStrong}s{f}. The text
matched by {fStrong}s{f} is included when determining whether this rule is
the {fUnderline}longest match{f}, but is then returned to the input before
the action is executed. So the action only sees the text matched
by {fStrong}r{f}. This type of pattern is called {fUnderline}trailing context{f}.
(There are some combinations of {fEmphasis}{fStrong}r{f}\/{fStrong}s{f}{f} that {fCode}flex{f}
cannot match correctly; see notes in the Deficiencies \/ Bugs section
below regarding "dangerous trailing context".)
#Indent
{fEmphasis}^{fStrong}r{f}{f}
#Indent +4
an {fStrong}r{f}, but only at the beginning of a line (i.e.,
which just starting to scan, or right after a
newline has been scanned).
#Indent
{fEmphasis}{fStrong}r{f}${f}
#Indent +4
an {fStrong}r{f}, but only at the end of a line (i.e., just
before a newline). Equivalent to "{fStrong}r{f}\/\\n".
Note that flex's notion of "newline" is exactly
whatever the C compiler used to compile flex
interprets '\\n' as; in particular, on some DOS
systems you must either filter out \\r's in the
input yourself, or explicitly use {fStrong}r{f}\/\\r\\n for "r$".
#Indent
{fEmphasis}<{fStrong}s{f}>{fStrong}r{f}{f}
#Indent +4
an {fStrong}r{f}, but only in start condition {fStrong}s{f} (see
below for discussion of start conditions)
<{fStrong}s1{f},{fStrong}s2{f},{fStrong}s3{f}>{fStrong}r{f}
same, but in any of start conditions {fStrong}s1{f},
{fStrong}s2{f}, or {fStrong}s3{f}
#Indent
{fEmphasis}<\*>{fStrong}r{f}{f}
#Indent +4
an {fStrong}r{f} in any start condition, even an exclusive one.
#Indent
{fEmphasis}<<EOF>>{f}
#Indent +4
an end-of-file
<{fStrong}s1{f},{fStrong}s2{f}><<EOF>>
an end-of-file when in start condition {fStrong}s1{f} or {fStrong}s2{f}
#Indent
Note that inside of a character class, all regular
expression operators lose their special meaning except escape
('\\') and the character class operators, '-', ']', and, at
the beginning of the class, '^'.
The regular expressions listed above are grouped according
to precedence, from highest precedence at the top to
lowest at the bottom. Those grouped together have equal
precedence. For example,
#Wrap off
#fCode
foo|bar\*
#f
#Wrap on
is the same as
#Wrap off
#fCode
(foo)|(ba(r\*))
#f
#Wrap on
since the '\*' operator has higher precedence than
concatenation, and concatenation higher than alternation ('|').
This pattern therefore matches {fEmphasis}either{f} the string "foo" {fEmphasis}or{f}
the string "ba" followed by zero-or-more r's. To match
"foo" or zero-or-more "bar"'s, use:
#Wrap off
#fCode
foo|(bar)\*
#f
#Wrap on
and to match zero-or-more "foo"'s-or-"bar"'s:
#Wrap off
#fCode
(foo|bar)\*
#f
#Wrap on
In addition to characters and ranges of characters,
character classes can also contain character class
{fUnderline}expressions{f}. These are expressions enclosed inside {fEmphasis}[{f}: and {fEmphasis}:{f}]
delimiters (which themselves must appear between the '['
and ']' of the character class; other elements may occur
inside the character class, too). The valid expressions
are:
#Wrap off
#fCode
[:alnum:] [:alpha:] [:blank:]
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
#f
#Wrap on
These expressions all designate a set of characters
equivalent to the corresponding standard C {fEmphasis}isXXX{f} function. For
example, {fEmphasis}[:alnum:]{f} designates those characters for which
{fEmphasis}isalnum(){f} returns true - i.e., any alphabetic or numeric.
Some systems don't provide {fEmphasis}isblank(){f}, so flex defines
{fEmphasis}[:blank:]{f} as a blank or a tab.
For example, the following character classes are all
equivalent:
#Wrap off
#fCode
[[:alnum:]]
[[:alpha:][:digit:]
[[:alpha:]0-9]
[a-zA-Z0-9]
#f
#Wrap on
If your scanner is case-insensitive (the {fEmphasis}-i{f} flag), then
{fEmphasis}[:upper:]{f} and {fEmphasis}[:lower:]{f} are equivalent to {fEmphasis}[:alpha:]{f}.
Some notes on patterns:
#Indent +4
- A negated character class such as the example
"[^A-Z]" above {fEmphasis}will match a newline{f} unless "\\n" (or an
equivalent escape sequence) is one of the
characters explicitly present in the negated character
class (e.g., "[^A-Z\\n]"). This is unlike how many
other regular expression tools treat negated
character classes, but unfortunately the inconsistency
is historically entrenched. Matching newlines
means that a pattern like [^"]\* can match the
entire input unless there's another quote in the
input.
- A rule can have at most one instance of trailing
context (the '\/' operator or the '$' operator).
The start condition, '^', and "<<EOF>>" patterns
can only occur at the beginning of a pattern, and,
as well as with '\/' and '$', cannot be grouped
inside parentheses. A '^' which does not occur at
the beginning of a rule or a '$' which does not
occur at the end of a rule loses its special
properties and is treated as a normal character.
The following are illegal:
#Wrap off
#fCode
foo\/bar$
<sc1>foo<sc2>bar
#f
#Wrap on
Note that the first of these, can be written
"foo\/bar\\n".
The following will result in '$' or '^' being
treated as a normal character:
#Wrap off
#fCode
foo|(bar$)
foo|^bar
#f
#Wrap on
If what's wanted is a "foo" or a
bar-followed-by-a-newline, the following could be used (the special
'|' action is explained below):
#Wrap off
#fCode
foo |
bar$ \/\* action goes here \*\/
#f
#Wrap on
A similar trick will work for matching a foo or a
bar-at-the-beginning-of-a-line.
#Indent